0. Origin of the Neural Networks : the perceptron

A neurology primer :

A neuron is a cell which can be electrically excited. It processes and transmits information in the brain. Typically, a neuron is made of a cell body, dendrites, and an axon. Networks of neurons are interconnected in the brain by membranes called synapses. Synaptic signals from other neurons are received at the dendrites while signals to other neurons are transmitted by the axon.

Perceptrons were first introduced in the 1950s and 1960s as an artificial neuron.

It is easy to see the analogy between the two : in an artificial neuron (perceptron), several inputs are combined (mimicking the role of the dendrites) to produce an output (mimicking the role of the axon).

A simple way to combine the inputs into a single output is to compute a weighted sum. The value of this sum with respect to a given threshold gives the output.

\begin{eqnarray} \mbox{output} & = & \left\{ \begin{array}{ll} 0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\ 1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold} \end{array} \right. \end{eqnarray}

We an actually replace the threshold by introducing what we call a bias term :

$$\begin{eqnarray} \mbox{output} = \left\{ \begin{array}{ll} 0 & \mbox{if } w\cdot x + b \leq 0 \\ 1 & \mbox{if } w\cdot x + b > 0 \end{array} \right. \end{eqnarray}$$

This will prove handy in the following.

It is quite easy to visualise the perceptron as a way of weighing information. Assuming we want to answer the question : will it rain tomorrow ?. We can gather some data about the weather today :

Temperature today
Wind direction
Weather in a 100 km radius

The perceptron combines all this information with varying importance (the weights) and eventually fires a yes or no answer.

The weights can be determined using existing data and minimising the prediction error on this data with gradient descent.

Why stop at a single neuron ? After all, the brain is made up of about 100 billion neurons! We can combine several perceptrons together (see fig) in the hope of creating greater levels of abstractions which will allow for more complex decisions.

In the following, we will now see how the weights and biases in a complicated network can be tuned to adapt automatically to any data.

1. From the perceptron to the sigmoid neuron

Recall that perceptrons take decisions on hard thresholding. This makes it difficult to tune our network. Let's give an example. Sat we want to improve the network's performance on misclassified data points. We can make a small change on one of the weight and see if classification improves. However, with hard thresholding, small changes may completely alter the behaviour of the network as the output may brutally fall on the other side of the threshold. This problem is alleviated by the use of a continuous activation function, like the sigmoid (soft thresholding).

With the sigmoid, the output of the neural network becomes :

$$\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)}. \end{eqnarray}$$

The differences between hard and soft thresholding are illutrated below

In fact, any function with a shape similar to the sigmoid (a smooth transition from 0 to 1 between) would suit our need of soft thresholding. It just so happens that the sigmoid has properties which make it particularly attractive.

2. Architecture of the neural network

Now that we have the base unit of our network figured out, how do we build the full network ? There are arbitrarily many complicated ways of building a network (think feedback loops, arbitrary connections from one layer to another ...). We are going to focus on a subclass of networks, which will prove handy to parameterize : feedforward fully connected neural networks.

Such networks are organised in layers, each unit of each layer being connected to each unit of the following layer.

3. Training the neural network : backpropagation

From now on, we'll work with the MNIST data set to provide an illustration to our discussion. This dataset is organised in the following way: It is split into two parts. There is a training part (60 000 images, represented as 28*28 arrays with the grayscale value of the relevant pixel) and a test part (10 000 similar images). Each image corresponds to a single handwritten digit.

Architecture :

We will use a simple design : 3 layers (inputs + one hidden layer + one output layer).

One such network is shown below, with 15 hidden layer cells

Notations :

x is the input, a 28*28 = 784 dimensional vector y is the output, a 10-dimensional vector such that (1,0,0,0,0,0,0,0,0,0) represents 0, (0,1,0,0,0,0,0,0,0,0) represents 1 and so on. a is the 10-dimensional output of the neural network

We will quantify the success of our model with the well known squared loss error :

$$\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2. \end{eqnarray}$$

The reason behind this choice is that the square loss function is smooth in the parameters of the networks (i.e. its weights and biases) which makes it easier to understand how small changes in the weights and biases may affect the classification performance.

Gradient descent :

In order to minimise the error $C$ identified above, we are going to use gradient descent.
In fact, we will make use of a slightly modified version of gradient descent to alleviate computing constraints. Indeed, vanilla gradient descent updates the gradient of $C$ by computing the derivatives with respect to all data points at each update. If we have a large training set, this may prove untractable.
That's why we'll implement stochastic batch gradient descent :

We will randomly select some data points: a mini batch
And we will compute the gradient update with respect to these points only
We continue sampling the training set randomly (without replacement) until it is exhausted. This completes an epoch of training.
We iterate over several epochs.

The size of the batch can vary, with a small size meaning fast computations at the cost of a loss of accuracy in the computation of the gradient.

Let us recapitulate : our goal is to tailor the weights $w$ and the biases $b$ of the neural network so that we minimise the cost $C$.
Gradient descent helps us achieve this goal by providing a way to progressively update all $w$ and $b$ to their ideal values.

Mathematically :

$$ \begin{eqnarray} w & \rightarrow & w' = w-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w} \\ b & \rightarrow & b' = b-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b}, \end{eqnarray}$$

where the sum is on a mini-batch.

Backpropagation

Let's move on to the more difficult bit.
The backpropagation algorithm provides us with an efficient way of computing the gradient of $C$. Let's see how.

Notation

$b^l_j$ is the bias of the jth neuron in the lth layer.
$a^l_j$ is the activation of the jth neuron in the lth layer.
$w^l_{jk}$ is the weight for the connection from the kth neuron in the (l−1)th layer to the jth neuron in the lth layer.

Let us write down a first equation to write the output of a layer :

\begin{eqnarray} a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right), \end{eqnarray}

This equation expresses the architecture of our network: the sum is on all the neurons of the previous layer, thereby making our network fully connected.
We will use the following shorthand notation as well :

$$z^l_j = \sum_k w^l_{jk} a^{l-1}_k+b^l_j$$

This immediately gives :

$$a^l = \sigma(z^l)$$

Computing the derivatives of $C$ :

Now that the notations are in place, recall that we want to compute the derivatives of $C$: $\hspace{2mm}\partial C / \partial b^l_j$ and $\partial C / \partial w^l_{jk}$.

a) Computing $\hspace{2mm}\partial C / \partial b^l_j$ :

Using the chain rule, we write :

\begin{equation} \frac{\partial C}{\partial b^l_j} = \sum_k \frac{\partial C}{\partial z^l_k}\frac{\partial z^l_k}{\partial b^l_j } \end{equation}

which reduces to :

\begin{equation} \frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial z_l^j} \equiv \delta^l_j \end{equation}

where we have introduced the intermediate quantity $\delta^l_j \equiv \frac{\partial C}{\partial z_l^j}$.

b) Computing $\delta^l_j $ :

Let's compute it in two times.

For the final layer $L$, using the chain rule,

\begin{eqnarray} \frac{\partial C}{\partial z^L_j} &= \sum_i \frac{\partial C}{\partial a^L_i}\frac{\partial a^L_i}{\partial z^L_{j} }\\ &= \frac{\partial C}{\partial a^L_j}\frac{\partial a^L_j}{\partial z^L_{j}} \\ & = \frac{\partial C}{\partial a^L_j} \sigma^{\prime}(z^L_j) \end{eqnarray}

For any other layer $l$, using the chain rule:

\begin{eqnarray} \delta^l_j & = & \frac{\partial C}{\partial z^l_j}\\ & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j}\\ & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \end{eqnarray}

note that :

\begin{eqnarray} z^{l+1}_k & = & \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k\\ & = & \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k \end{eqnarray}

hence :

\begin{eqnarray} \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j). \end{eqnarray}

thus :

\begin{eqnarray} \delta^l_j & = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j) \\ \delta^l & = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber \end{eqnarray}

The last line expresses the result using the Hadamard product.

c) Computing $\partial C / \partial w^l_{jk}$ :

All that's left is to compute $\partial C / \partial w^l_{jk}$. Once again, we make use of the chain rule :

\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} & = \sum_i \frac{\partial C}{\partial z^l_i}\frac{\partial z^l_i}{\partial w^l_{jk} }\\ & = a^{l-1}_k \delta^l_j \end{eqnarray}

The backpropagation algorithm

The formulas we established guide us towards the backpropagation algorithm :

Initialise by setting $a^1$ as the input data
Feedforward : propagate by computing $z^l$ and $a^l$ for each leayer
Compute $\delta^L$.
Backpropagate : compute $\delta^l$ for all remaining layers
Use $\delta$ to compute $\hspace{2mm}\partial C / \partial b^l_j$ and $\partial C / \partial w^l_{jk}$.

4. Python implementation

The code below is an implementation of neural networks trained by backpropagation in pure python.



In [5]:

    
# Let's start by importing relevant packages :
import numpy as np
import os
import cPickle as pickle
import gzip
import subprocess



In [10]:

    
# Define a function to unzip the mnist data and load it
def get_mnist_data():
    # Unzip and load the data set
    f = gzip.open("../data/mnist.pkl.gz", "rb")
    train, val, test = pickle.load(f)
    f.close()
    return train, val, test

# Format the mnist target function
def format_mnist(y):
    """ Convert the 1D to 10D """

    # convert to 10 categeories
    y_new = np.zeros((10, 1))
    y_new[y] = 1
    return y_new
    
# Let's load and format the data for our neural network
train, test, valid = get_mnist_data()
training_inputs = [np.reshape(x, (784, 1)) for x in train[0]]
training_results = [format_mnist(y) for y in train[1]]
training_data = zip(training_inputs, training_results)
test_fm = []
for i in range(len(test[0])):
    test_fm.append((test[0][i], test[1][i]))
test = test_fm



In [11]:

    
# Lets check the dimensions of our data
print len(training_data), len(test)









    



50000 10000



In [12]:

    
# Neural Network code per se
# All documentation is inline


class ActivationFunction(object):

    """ Activation function class
    
    We define the activation function and its derivative.
    Only one choice : the sigmoid.
    Feel free to add more !
    
    """

    @staticmethod
    def sigmoid():
        return lambda x: 1. / (1 + np.exp(-x))

    @staticmethod
    def dsigmoid():
        return lambda x: 1. / (1 + np.exp(-x)) * (1  - 1. / (1 + np.exp(-x)))

class CostFunction(object):

    """ Cost function class
    
    We define one function : the mse
    Feel free to add your own (like the cross entropy)
    """

    @staticmethod
    def mse():
        return lambda y,a: 0.5 * np.power(y-a, 2)

    @staticmethod
    def dmse():
        return lambda y,a: a - y

class BasicNeuralNet(object):


    """
    
    Neural network class
    Implemented in python
    
    Main parts :
    
    __init__ = initialisation of the neural networks
    the weights and biases are randomly initialised
    
    _forward_pass() = operates the feedforward pass discussed above
    _backward_pass() = operates the backpropagation pass discussed above
    _update_gradient() = computes dCdw and dCdb for a mini batch as explained above
    
    fit_SGD() = fit the neural network to the data
                it loops over epochs, create mini batches of the data
                and minimises the cost function by gradient descent
    score() = evaluate the classification performance on specified test_data
    
    """
    def __init__(
            self,
            sizes,
            lmbda = 0,
            actfuncname="sigmoid",
            costfuncname="mse",
            verbose=False):
        self._nlayers = len(sizes)
        self._sizes = sizes
        self._lmbda = lmbda
        # Random initialisation of biases and weights.
        #For the weights, use gaussian with std = sqrt(# of weights connecting to a neuron)
        # So that by the CLT, their sum is gaussian with std = 1
        # Add [0] for clearer indexing
        self._biases = [np.array([0])] + [np.random.randn(size, 1)
                                          for size in self._sizes[1:]]
        self._weights = [np.array([0])] + [np.random.randn(self._sizes[i], self._sizes[i - 1])/ \
                                           np.sqrt(self._sizes[i-1])
                                           for i in range(1, self._nlayers)]

        # Initialisation of z
        self._z = [np.array([0])] + [np.zeros((size, 1))
                                     for size in self._sizes[1:]]

        # Activation function
        self._actfuncname = actfuncname
        if self._actfuncname == "sigmoid":
            self._actfunc = ActivationFunction.sigmoid()
            self._dactfunc = ActivationFunction.dsigmoid()

        # Cost function
        self._costfuncname = costfuncname
        if self._costfuncname == "mse":
            self._costfunc = CostFunction.mse()
            self._dcostfunc = CostFunction.dmse()


    def _forward_pass(self, x):
        # Initialisation of activation matrix
        self._a = [x]
        for layer in range(1, self._nlayers):
            self._z[layer] = np.dot(self._weights[layer], self._a[layer-1]) \
                + self._biases[layer]
            a = self._actfunc(self._z[layer])
            self._a.append(a)

        # For scoring
        return self._a[-1]

    def _backward_pass(self, y):
        # Initialisation of error matrix
        delta_L = self._dcostfunc(y, self._a[-1]) \
            * self._dactfunc(self._z[-1])
        self._delta = [delta_L]
        for layer in range(1, self._nlayers - 1)[::-1]:
            delta_l = np.dot(
                self._weights[layer + 1].T, self._delta[self._nlayers - layer -2]) \
                * self._dactfunc(self._z[layer])

            self._delta = [delta_l] + self._delta
        self._delta = [np.array([0])] + self._delta

    def _update_gradient(self, batch, n_training):

        n = len(batch)

        # Initialise derivative of cost wrt bias and weights
        dCdb = [np.array([0])] + [np.zeros((size,1)) for size in self._sizes[1:]]
        dCdw = [np.array([0])] + [np.zeros((self._sizes[i], self._sizes[i - 1]))
                                  for i in range(1, self._nlayers)]
        # Loop over batch
        for X, y in batch:
            self._forward_pass(X)
            self._backward_pass(y)

            # Loop over layers
            for layer in range(1, self._nlayers):
                dCdb[layer] += self._delta[layer]/float(n)
                dCdw[layer] += np.dot(self._delta[layer], self._a[layer - 1].T)/float(n) + self._lmbda * self._weights[layer]/float(n_training)

        return dCdb, dCdw

    def fit_SGD(self, training_data, learning_rate, batch_size, epochs, test_data = None):

        n_samples = len(training_data)
        # Loop over epochs
        for ep in range(epochs):

            #Shuffle data
            np.random.shuffle(training_data)

            for k in xrange(0, n_samples, batch_size):
                batch = training_data[k:k+batch_size]

                dCdb, dCdw = self._update_gradient(batch, n_samples)
                # Update bias and weights
                self._biases = [self._biases[layer] - learning_rate * dCdb[layer]
                           for layer in range(self._nlayers)]
                self._weights = [self._weights[layer] - learning_rate * dCdw[layer]
                                 for layer in range(self._nlayers)]
            print "Epoch %s:" %ep, self.score(test_data), "/", len(test_data)

    def score(self, test_data):
        """ Score """

        test_results = [(np.argmax(self._forward_pass(np.reshape(x, (len(x), 1)))), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)



In [13]:

    
# Define the parameters of the neural network
d_NN = {"sizes": [784, 30, 10],
            "actfuncname": "sigmoid",
            "costfuncname": "mse",
            "batch_size": 10,
            "learning_rate": 3,
            "epochs": 30,
            "lambda":0,
            "verbose": True}

# Set the seed for reproducibility
np.random.seed(10)

As advertised above, we use the sigmoid as our activation function and the mean squared error as our cost function. The other important parameters are :

sizes : specifies for each layer the number of unit. Here, we have 3 layers. The input layer has 784 units, the hidden layer has 30 units and the output layer has 10.
batch_size : The number of samples in the mini batch
epochs : The number of training epochs
lambda : The regularisation parameter (this is more advanced, stick to 0)
learning_rate : The learning rate in the gradient descent algorithm. This quantifies how quickly we follow the direction of increasing gradient



In [14]:

    
# Create NN model and fit
NeuralNet = BasicNeuralNet(
        d_NN["sizes"],
        lmbda = d_NN["lambda"],
        actfuncname=d_NN["actfuncname"],
        costfuncname=d_NN["costfuncname"],
        verbose=d_NN["verbose"])

NeuralNet.fit_SGD(
    training_data,
    d_NN["learning_rate"],
    d_NN["batch_size"],
    d_NN["epochs"],
    test_data=test)

# This may take a while to train ...









    



Epoch 0: 9443 / 10000
Epoch 1: 9481 / 10000
Epoch 2: 9550 / 10000
Epoch 3: 9539 / 10000
Epoch 4: 9587 / 10000
Epoch 5: 9592 / 10000
Epoch 6: 9580 / 10000
Epoch 7: 9590 / 10000
Epoch 8: 9572 / 10000
Epoch 9: 9600 / 10000
Epoch 10: 9586 / 10000
Epoch 11: 9561 / 10000
Epoch 12: 9570 / 10000
Epoch 13: 9596 / 10000
Epoch 14: 9565 / 10000
Epoch 15: 9587 / 10000
Epoch 16: 9595 / 10000
Epoch 17: 9572 / 10000
Epoch 18: 9594 / 10000
Epoch 19: 9591 / 10000
Epoch 20: 9590 / 10000
Epoch 21: 9575 / 10000
Epoch 22: 9573 / 10000
Epoch 23: 9599 / 10000
Epoch 24: 9567 / 10000
Epoch 25: 9598 / 10000
Epoch 26: 9577 / 10000
Epoch 27: 9571 / 10000
Epoch 28: 9573 / 10000
Epoch 29: 9581 / 10000

You can now play around with the parameters and see whether you can improve the classification accuracy ! (Change number of layers, number of units, learning_rate ...)